Introduction

This report will analyse bulk RNA sequencing data from the GEO database, accession number: GSE213001, which looks at the lung tissue of people affected and unaffected by idiopathic pulmonary fibrosis (IPF). This data is associated with an article by Jaffar et al. published in 2022 (Jaffar et al., 2022) which investigates a biomarker of IPF disease development.

Obtaining and assessing the data

The data was obtained using the GEOquery Bioconductor package (Davis & Meltzer, 2007). Samples were collected from 20 diseased patients and 14 non-diseased controls. A total of 139 samples were collected. Summary statistics can be seen below for samples per condition (disease/normal) and samples per pateint.

Table 1. Distribution of diseased vs healthy tissue samples
Tissue type Frequency
Disease 98
Normal 41
Table 2. Distribution of samples per patient
Samples per patient Number of patients
1 2
2 20
3 1
4 21
5 2

Figure 1. Library size of all 139 samples. The mean library size is 13.8 million reads.

This dataset contains 15,065 genes and 139 samples. There are no duplicate genes or samples.

Mapping Ensembl IDs to HGNC symbols

Using Ensembl (Dyer et al., 2025), Ensembl IDs found in the expression data were mapped to HGNC symbols, which are more readable and useful for downstream processing.

Table 3. Preview of expression data with mapped HGNC symbols
ensembl_gene_id hgnc_symbol ALF001 ALF004C ALF004D
ENSG00000000003 TSPAN6 426 340 376
ENSG00000000419 DPM1 410 289 379
ENSG00000000457 SCYL3 226 195 272
ENSG00000000460 FIRRM 56 58 92
ENSG00000000938 FGR 726 1889 1979
ENSG00000000971 CFH 7922 5938 6026

Genes that did not map to a HUGO gene symbol were retained under their Ensembl ID, as they may still show differential expression. There were no duplicate genes, though if there had been, average expression could have be taken.

Normalization

First, we would like to get an overview of the data by making a boxplot and a density plot of read counts. Counts per million were calculated using EdgeR (Robinson, McCarthy, & Smyth, 2010).

Figure 2. Box plot of counts per million (CPM) of all 139 samples CPM calculated using EdgeR (Robinson et al., 2010). Code adapted from BCB420 Lecture 5.

Some samples have a mean CPM clearly lower than the rest, but with so many samples it is difficult to compare them all. A density plot can help with this visualization.

Figure 3. Density plot of counts per million (CPM) for all 139 samples. CPM calculated using EdgeR (Robinson et al., 2010). Code adapted from BCB420 Lecture 5.

As expected, there is somewhat of a peak near zero counts as many genes are expressed in low quantity. One sample (orange) stands out from the rest, with the CPM curve shifted to the left.

Filter out genes with low expression

Next, EdgeR was used to filter out genes with low expression levels (counts < 3).

Figure 4. Box plot of counts per million of all 139 samples after removal of genes which are lowly expressed (< 3 counts).

Again, a density plot was made for visualization, and the drop in low counts is clear:

Figure 5. Density plot of counts per million after removal of genes which are lowly expressed (< 3 counts).

Using a design matrix

It is important to use a design matrix as it allows filtering to be done in a way that considers our conditions (normal vs diseased tissue). Without one, genes that are differentially expressed between conditions, could be filtered out if they have low counts in one of the two conditions.

Information on whether the sample is from a diseased on healthy person is found in the metadata, and this will be used to make the design matrix.

Figure 6. Box plot of counts per million of all 139 samples after removal of lowly expressed genes and using a design matrix

Once again, a density plot allows us to visualise the change in distribution:

Figure 7. Density of counts per million of all 139 samples after removal of lowly expressed genes and using a design matrix

Compared to the initial plot, we can see that the number of genes with low counts has decreased. We should expect a similar result after the following normalization steps.

Normalization using Trimmed Mean of M-values (TMM)

TMM is a sample-based normalization method, based off of the idea that most genes aren’t differentially expressed. It compares expression to a reference sample and is commonly used with RNA-seq data. It finds the log expression ratio (M value) for all genes, comparing to the reference, trims data with extreme M and A (expression) values, and calculates a weighted mean of the remaining ratios (Robinson & Oshlack, 2010).

EdgeR is used to create the DGEList object, which will serve as input for future steps (Robinson et al., 2010).

Figure 8. Box plot of Trimmed Mean of M-values (TMM)-normalized gene expression for all 139 samples. The default minimum count of 10 is used.

After TMM normalization, there are no obvious outliers, and the means are more closely aligned than after manual cleaning (Figure 6).

Figure 9. Box plot of Trimmed Mean of M-values (TMM)-normalized gene expression for all 139 samples. The default minimum count of 10 is used.

Compared to the original data (Figure 3), we can see that low counts have been removed and the curve is overall more compact.

MDS Plot

Next, we represented the data in an a multidimensional scaling (MDS) plot to visualize the similarities and differences between samples.

Figure 10. Multidimensional scaling (MDS) plot of TMM normalized data Red circles represent healthy donor samples, and black triangles represent diseased donor samples. Points are positioned based on similarity.

By inspection, there is not a clear separation between diseased and normal tissues. With all points being quite spread out, there are no obvious outliers.

Differential expression analysis

First, a biological coefficient of varience (BCV) plot was created to show how gene expression varies between samples. This plot can be used to verify the dispersion of genes (BCV is the square root of dispersion).

Figure 11. Biological coefficient of variance (BCV) plot Common BCV plotted in red and trend BCV plotted in blue.

Common BCV is approximately 50%, while the trend BCV, which accounts for expression level, decreases from ~70% in less expressed genes to ~40% in higher expressed genes. This decrease is expected as genes with low count tend to contain more noise. Overall, the points remain relatively close to the trendline, which indicates successful dispersion estimation.

DE analysis with limma

Some points have very high dispersion (maximum of 7.55), which could be due to each of the 139 samples being treated independently. Of the three most widely used differential expression analysis tools (limma, EdgeR and DESeq2), which perform similarly (Liu et al., 2021), limma (Ritchie et al., 2015) has the advantage of the duplicateCorrelation function, which models within-patient correlation (takes into account multiple samples coming from one individual). This is important as many patients contributed multiple samples to this data set, and treating them as indepent could inflate variance measurements. Voom was used to transform data from EdgeR so that it can be used by limma (Law, Chen, Shi, & Smyth, 2014).

Differential expression results

The consensus correlation is 0.322, which indicates that there is a moderate correlation between samples from a single patient. A threshold of 0.05 was chosen for both p-value and FDR/multiple hypothesis correction as this is the standard cutoff used. After differential expression analysis, 8,192 genes pass the p < 0.05 threshold and 7,182 pass FDR correction (at most 5% of significant genes are expected to be false positives) out of the total 15,065 genes. Next, we can look at the top genes which are the most differentially expressed.

kable(head(results), caption = "Table 4. Preview of top differentially expressed genes", digits = 50, col.names = c("HGNC symbol", "logFC", "Average Expression", "p-value", "FDR corrected p-value"))
Table 4. Preview of top differentially expressed genes
HGNC symbol logFC Average Expression p-value FDR corrected p-value
LDLRAD4 1.295029 5.336294 1.743361e-31 2.618354e-27
NECAB1 -3.453363 0.965507 2.762368e-29 2.074400e-25
LTBP1 1.690025 7.753609 3.765045e-27 1.884907e-23
TMEM97 -1.669245 3.762740 6.396247e-27 2.401631e-23
CFH 1.825359 8.729778 2.564183e-26 7.702294e-23
ACSS2 -1.206457 6.137581 2.594291e-25 6.493942e-22

Volcano plot of EdgeR results

Figure 12. Volcano plot representing the differential expression of genes in IPF. Red points represent genes which are significantly upregulated in diseased tissue (FDR < 0.05, logFC > 0), blue points represent genes that are significantly downregulated in diseased tissue (FDR < 0.05, logFC < 0), and grey points represent genes that are not significantly differentially expressed.

Table 5. Preview of most differentially expressed genes by FDR-corrected p-value
Gene logFC FDR-corrected p-value
LDLRAD4 1.295029 2.618354e-27
NECAB1 -3.453363 2.074400e-25
LTBP1 1.690025 1.884907e-23
TMEM97 -1.669245 2.401631e-23
CFH 1.825359 7.702294e-23
ACSS2 -1.206457 6.493942e-22

Heatmap

Finally, a heatmap was used to visualize gene expression in diseased tissues. Genes that are upregulated or downregulated in diseased tissues could be important to understanding IPF development.

Figure 13. Heatmap of top 20 differentially expressed genes based on FDR-corrected p-value. Comparing diseased tissues to normal tissues. Rows represent the top 20 genes and columns represent the 139 samples.Red represents positive z-scores (upregulation), and blue represents negative z-scores (downregulation).

There is clear clustering observed where genes are upregulated in one condition (diseased/normal tissue) and downregulated in the other.

Questions:

  1. Why is the dataset of interest to you?
    I hadn’t heard of Idiopathic pulmonary fibrosis (IPF) before, and the topic interested me as it was different to research I had done previously (cancer/microbiome research). I read the paper and appreciated its brevity and straightforwardness, and the potential offered by the large number of samples.
  2. What are the control and test conditions of the dataset?
    Test conditions
  3. How many samples in each of the conditions of your dataset?
    Test conditions
  4. Were there expression values that were not unique for specific genes? How did you handle these?
    No. Test conditions
  5. Were there expression values that could not be mapped to current HUGO symbols?
    Yes. Mapping Ensembl IDs to HGNC symbols
  6. Were there any outliers in your dataset? How were they handled in the originating paper? How many outliers were removed?
    No clear outliers were identified based on visual inspection of boxplots and MDS plot. Normalization using Trimmed Mean of M-values (TMM) and MDS Plot
  7. How did you handle replicates?
    DE analysis with limma
  8. What is the final coverage of your dataset?
    Differential expression results

References

Davis, S., & Meltzer, P. S. (2007). GEOquery: A bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23(14), 1846–1847. https://doi.org/10.1093/bioinformatics/btm254
Dyer, S. C., Austine-Orimoloye, O., Azov, A. G., Barba, M., Barnes, I., Barrera-Enriquez, V. P., … Yates, A. D. (2025). Ensembl 2025. Nucleic Acids Research, 53(D1), D948–D957. https://doi.org/10.1093/nar/gkae1071
Jaffar, J., Wong, M., Fishbein, G. A., Alhamdoosh, M., McMillan, L., Gamell-Fulla, C., … Westall, G. (2022). Matrix metalloproteinase-7 is increased in lung bases but not apices in idiopathic pulmonary fibrosis. ERJ Open Research, 8(4), 00191–02022. https://doi.org/10.1183/23120541.00191-2022
Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2), R29. https://doi.org/10.1186/gb-2014-15-2-r29
Liu, S., Wang, Z., Zhu, R., Wang, F., Cheng, Y., & Liu, Y. (2021). Three Differential Expression Analysis Methods for RNA Sequencing: Limma, EdgeR, DESeq2. Journal of Visualized Experiments (JoVE), (175), e62528. https://doi.org/10.3791/62528
Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. https://doi.org/10.1093/nar/gkv007
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. https://doi.org/10.1186/gb-2010-11-3-r25